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Abstract 

This paper compares a family of methods for characterizing neural feature selec- 
tivity with natural stimuli in the framework of the linear-nonlinear model. In this 
model, the neural firing rate is a nonlinear function of a small number of relevant 
stimulus components. The relevant stimulus dimensions can be found by max- 
imizing one of the family of objective functions, Renyi divergences of different 
orders [[1] |2l . We show that maximizing one of them, Renyi divergence of or- 
der 2, is equivalent to least-square fitting of the linear-nonlinear model to neural 
data. Next, we derive reconstruction errors in relevant dimensions found by max- 
imizing Renyi divergences of arbitrary order in the asymptotic limit of large spike 
numbers. We find that the smallest errors are obtained with Renyi divergence of 
order 1, also known as Kullback-Leibler divergence. This corresponds to finding 
relevant dimensions by maximizing mutual information |2|. We numerically test 
how these optimization schemes perform in the regime of low signal-to-noise ra- 
tio (small number of spikes and increasing neural noise) for model visual neurons. 
We find that optimization schemes based on either least square fitting or informa- 
tion maximization perform well even when number of spikes is small. Information 
maximization provides slightly, but significantly, better reconstructions than least 
square fitting. This makes the problem of finding relevant dimensions, together 
with the problem of lossy compression (]3], one of examples where information- 
theoretic measures are no more data limited than those derived from least squares. 



1 Introduction 

The application of system identification techniques to the study of sensory neural systems has a 
long history. One family of approaches employs the dimensionality reduction idea: while inputs 
are typically very high-dimensional, not all dimensions are equally important for eliciting a neural 
response |4]|5]|6l|7][8i. The aim is then to find a small set of dimensions {ii, 62, . . .} in the stimulus 
space that are relevant for neural response, without imposing, however, a particular functional de- 
pendence between the neural response and the stimulus components {si, S2, ■ ■ ■} along the relevant 
dimensions: 

P(spike|s) = P(spike)g(si, S2, sr), (1) 

If the inputs are Gaussian, the last requirement is not important, because relevant dimensions can be 
found without knowing a correct functional form for the nonlinear function g in Eq. ([T]i. However, 
for non-Gaussian inputs a wrong assumption for the form of the nonlinearity g will lead to systematic 
errors in the estimate of the relevant dimensions themselves [9, 5 , 1 , 2J. The larger the deviations of 
the stimulus distribution from a Gaussian, the larger will be the effect of errors in the presumed form 
of the nonlinearity function g on estimating the relevant dimensions. Because inputs derived from a 
natural environment, either visual or auditory, have been shown to be strongly non-Gaussian ifTOll . we 



will concentrate here on system identification methods suitable for either Gaussian or non-Gaussian 
stimuli. 

To find the relevant dimensions for neural responses probed with non-Gaussian inputs. Hunter and 
Korenberg proposed an iterative scheme |5 1 where the relevant dimensions are first found by assum- 
ing that the input-output function g is linear. Its functional form is then updated given the current 
estimate of the relevant dimensions. The inverse of g is then used to improve the estimate of the 
relevant dimensions. This procedure can be improved not to rely on inverting the nonlinear function 
g by formulating optimization problem exclusively with respect to relevant dimensions 1 1 , 2 1, where 
the nonlinear function g is taken into account in the objective function to be optimized. A family of 
objective functions suitable for finding relevant dimensions with natural stimuli have been proposed 
based on Renyi divergences |[T] between the the probability distributions of stimulus components 
along the candidate relevant dimensions computed with respect to all inputs and those associated 
with spikes. Here we show that the optimization problem based on the Renyi divergence of order 2 
corresponds to least square fitting of the linear-nonlinear model to neural spike trains. The Kullback- 
Leibler divergence also belongs to this family and is the Renyi divergence of order 1 . It quantifies 
the amount of mutual information between the neural response and the stimulus components along 
the relevant dimension |2|. The optimization scheme based on information maximization has been 
previously proposed and implemented on model |2 1 and real cells 1 11 1. Here we derive asymptotic 
errors for optimization strategies based on Renyi divergences of arbitrary order, and show that rele- 
vant dimensions found by maximizing KuUback-Leibler divergence have the smallest errors in the 
limit of large spike numbers compared to maximizing other Renyi divergences, including the one 
which implements least squares. We then show in numerical simulations on model cells that this 
trend persists even for very low spike numbers. 



2 Variance as an Objective Function 



One way of selecting a low-dimensional model of neural response is to minimize a x^-difference 
between spike probabilities measured and predicted by the model after averaging across all inputs s: 



F(spike|s) P(spike|s • v) 



P(spike) P(spike) ' ^ 

where dimension v is the relevant dimension for a given model described by Eq. ([T]l [multiple 
dimensions could also be used, see below]. Using the Bayes' rule and rearranging terms, we get: 

I n 2 



X'[v] =/dsP(s 



P(s|spikc) P(s • vjspike) 



P(s) 



P(s • v) 



^ ds 



[P(s|spike)]" 



dx 



[Pv(x|spike)]^ 



(3) 



In the last integral averaging has been carried out with respect to all stimulus components except 
for those along the trial direction v, so that integration variable a: = s ■ v. Probability distributions 
Pv{x) and Pv(x|spike) represent the result of this averaging across all presented stimuli and those 
that lead to a spike, respectively: 



J dsP{s)S{x — s ■ v), Pv(a:|spikc) = J dsP (s| spike) (5(a 



(4) 



where S{x) is a delta-function. In practice, both of the averages (01 are calculated by bining the 
range of projections values x and computing histograms normalized to unity. Note that if there 
multiple spikes are sometimes elicited, the probability distribution P(a;| spike) can be constructed 
by weighting the contribution from each stimulus according to the number of spikes it elicited. 

If neural spikes are indeed based on one relevant dimension, then this dimension will explain all of 
the variance, leading to = 0. For all other dimensions v, x^[v] > 0. Based on Eq. (O, in order 
to minimize x^ we need to maximize 



P[v] = / dxP^{x) 



Pv(a;|spike) 

Pv(x) 



(5) 



which is a Renyi divergence of order 2 between probability distribution {x \ spike) and Pv{x), and 
are part of a family of /-divergences measures that are based on a convex function of the ratio of 



the two probability distributions (instead of a power a in a Renyi divergence of order a) ITSlfTJl fn. 
For optimization strategy based on Renyi divergences of order a, the relevant dimensions are found 
by maximizing: 

a - I J 

By comparison, when the relevant dimension(s) are found by maximizing information the goal 
is to maximize Kullback-Leibler divergence, which can be obtained by taking a formal limit a — > 1: 

rr 1 /" , n / \ -Pv(a:^|spike) , Pv(a;|spike) /" , „ / i •, \ , f'vfa^lspike) 
/ V = / dxP^{x) ^ ' ^ ' \n ^ ' ^ = / 2; spike In ^ ' ' . (7) 

J Pv(x) Pv{x) J P^{x) 



Pv(a;|spike) 



(6) 



Returning to the variance optimization, the maximal value for F[v] that can be achieved by any 
dimension v is: 



= / ds 



[P(s|spike)]' 



(8) 



It corresponds to the variance in the firing rate averaged across different inputs (see Eq. (|9]l below). 
Computation of the mutual information carried by the individual spike about the stimulus relies on 
similar integrals. Following the procedure outlined for computing mutual information IJ4J . one can 
use the Bayes' rule and the ergodic assumption to compute P,nax as a time-average: 



T 



1 dt 






f 



(9) 



where the firing rate r{t) — P(spike|s)/At is measured in time bins of width Ai using multiple 
repetitions of the same stimulus sequence . The stimulus ensemble should be diverse enough to 
justify the ergodic assumption [this could be checked by computing F^ax for increasing fractions of 
the overall dataset size]. The average firing rate f = P(spike)/At is obtained by averaging r{t) in 
time. 

The fact that F[v] < Fmax can be seen either by simply noting that x^W\ — 0' or from the data 
processing inequality, which applies not only to Kullback-Leibler divergence, but also to Renyi 
divergences lfT2l[T3l [l1. In other words, the variance in the firing rate explained by a given dimension 
F[v] cannot be greater than the overall variance in the firing rate Pmax- This is because we have 
averaged over all of the variations in the firing rate that correspond to inputs with the same projection 
value on the dimension v and differ only in projections onto other dimensions. 

Optimization scheme based on Renyi divergences of different orders have very similar structure. In 
particular, gradient could be evaluated in a similar way: 



da;Pv(2;|spike) [(six, spike) — (six)] -7— 

ax 



Pv(x|spike) Y 

PA^) ) 



(10) 



where (s I spike) — /(iss(5(a; — s-v)P(s|spike)/P(a;|spike), andsimilarlyfor (s|a;). The gradient 
is thus given by a weighted sum of spike-triggered averages (s|x, spike) — (s|x) conditional upon 
projection values of stimuli onto the dimension v for which the gradient of information is being 
evaluated. The similarity of the structure of both the objective functions and their gradients for 
different Renyi divergences means that numeric algorithms can be used for optimization of Renyi 
divergences of different orders. Examples of possible algorithms have been described |[T]|2][TT] and 
include a combination of gradient ascent and simulated anneahng. 

Here are a few facts common to this family of optimization schemes. First, as was proved in the case 
of information maximization based on Kullback-Leibler divergence |i2J, the merit function P'"^ [v] 
does not change with the length of the vector v. Therefore v • VvP = 0, as can also be seen directly 
from Eq. ( flOl i. because v • (s|a;, spike) = x and v • (s|a;) — x. Second, the gradient is when 
evaluated along the true receptive field. This is because for the true relevant dimension according 
to which spikes were generated, (s|si, spike) — (s|si), a consequence of the fact that relevant 
projections completely determine the spike probability. Third, merit functions, including variance 
and information, can be computed with respect to multiple dimensions by keeping track of stimulus 



projections on all the relevant dimensions when forming probability distributions (|4|i. For example, 
in the case of two dimensions vi and V2, we would use 

Pvi,v2(a;i,a^2|spike) = J ds (5(xi — s • vi)5(x2 — s • V2)P(s|spike), 

-Pvi,v2(a;i,a;2) = J ds6{xi-s-vi)6{x2-s-V2)P{s), (11) 

to compute the variance with respect to the two dimensions as F[vi,V2] = 
/ dxidx2 [P(xi, a;2|spike)]^ /P{xi,X2)- 

If multiple stimulus dimensions are relevant for ehciting the neural response, they can always be 
found (provided sufficient number of responses have been recorded) by optimizing the variance 
according to Eq. (fTTT i with the correct number of dimensions. In practice this involves finding 
a single relevant dimension first, and then iteratively increasing the number of relevant dimensions 
considered while adjusting the previously found relevant dimensions. The amount by which relevant 
dimensions need to be adjusted is proportional to the contribution of subsequent relevant dimensions 
to neural spiking (the corresponding expression has the same functional form as that for relevant 
dimensions found by maximizing information, cf. Appendix B |2 |). If stimuli are either uncorrelated 
or correlated but Gaussian, then the previously found dimensions do not need to be adjusted when 
additional dimensions are introduced. All of the relevant dimensions can be found one by one, by 
always searching only for a single relevant dimension in the subspace orthogonal to the relevant 
dimensions already found. 



3 Illustration for a model simple cell 



Here we illustrate how relevant dimensions can be found by maximizing variance (equivalent to least 
square fitting), and compare this scheme with that of finding relevant dimensions by maximizing 
information, as well as with those that are based upon computing the spike-triggered average. Our 
goal is to reconstruct relevant dimensions of neurons probed with inputs of arbitrary statistics. We 
used stimuli derived from a natural visual environment [ 11 1 that are known to strongly deviate from 
a Gaussian distribution. All of the studies have been carried out with respect to model neurons. 
Advantage of doing so is that the relevant dimensions are known. The example model neuron is 
taken to mimic properties of simple cells found in the primary visual cortex. It has a single relevant 
dimension, which we will denote as ei. As can be seen in Fig. \lla), it is phase and orientation 
sensitive. In this model, a given stimulus s leads to a spike if the projection si = s ■ e\ reaches a 
threshold value 6* in the presence of noise: P(spikc|s)/P(spikc) = g(s\) = {H{si— 6 + £_)), where 
a Gaussian random variable ^ with variance ct^ models additive noise, and the function H{x) = 1 
for X > 0, and zero otherwise. The parameters 6 for threshold and the noise variance determine 
the input-output function. In what follows we will measure these parameters in units of the standard 
deviation of stimulus projections along the relevant dimension. In these units, the signal-to-noise 
ratio is given by a. 

Figure [T] shows that it is possible to obtain a good estimate of the relevant dimension ei by maxi- 
mizing either information, as shown in panel (b), or variance, as shown in panel(c). The final value 
of the projection depends on the size of the dataset, as will be discussed below. In the example 
shown in Fig.[T]there were « 50, 000 spikes with average probability of spike « 0.05 per frame, and 
the reconstructed vector has a projection Vmax ■ ei — 0.98 when maximizing either information or 
variance. Having estimated the relevant dimension, one can proceed to sample the nonlinear input- 
output function. This is done by constructing histograms for P(s • Wmax) and P(s ■ Umax | spike) of 
projections onto vector Wmax found by maximizing either information or variance, and taking their 
ratio. Because of the Bayes' rule, this yields the nonlinear input-output function g of Eq. ([T]i. In 
Fig- Ed) the spike probability of the reconstructed neuron P(spikc|s ■ Vmax) (crosses) is compared 
with the probability P(spike|si) used in the model (solid line). A good match is obtained. 

In actuality, reconstructing even just one relevant dimension from neural responses to correlated 
non-Gaussian inputs, such as those derived from real-world, is not an easy problem. This fact can 
be appreciated by considering the estimates of relevant dimension obtained from the spike-triggered 
average (STA) shown in panel (e). Correcting the STA by second-order correlations of the input 
ensemble through a multiplication by the inverse covariance matrix results in a very noisy estimate. 
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Figure 1: Analysis of a model visual neuron with one relevant dimension shown in (a). Panels (b) 
and (c) show normalized vectors {'max found by maximizing information and variance, respectively; 
(d) The probability of a spike P(spike|s ■ Vmax) (blue crosses - information maximization, red 
crosses - variance maximization) is compared to P(spike|si) used in generating spikes (solid line). 
Parameters of the model are a = 0.5 and 6* = 2, both given in units of standard deviation of si, 
which is also the units for the x-axis in panels (d and h). The spike-triggered average (STA) is shown 
in (e). An attempt to remove correlations according to the reverse correlation method, C~p^^^^^Vsta. 
(decorrelated STA), is shown in panel (f) and in panel (g) with regularization (see text). In panel 
(h), the spike probabilities as a function of stimulus projections onto the dimensions obtained as 
decorrelated STA (blue crosses) and regularized decorrelated STA (red crosses) are compared to a 
spike probability used to generate spikes (solid line). 

shown in panel (f). It has a projection value of 0.25. Attempt to regularize the inverse of covariance 
matrix results in a closer match to the true relevant dimension ||T5llT6l[T7l[T8llT9l and has a projection 
value of 0.8, as shown in panel (g). While it appears to be less noisy, the regularized decorrelated 
STA can have systematic deviations from the true relevant dimensions |9, 20, 2, 11|. Preferred 
orientation is less susceptible to distortions than the preferred spatial frequency liT9l . In this case 
regularization was performed by setting aside 1/4 of the data as a test dataset, and choosing a cutoff 
on the eigenvalues of the input covariances matrix that would give the maximal information value 
on the test dataset 1161 |19l . 



4 Comparison of Performance with Finite Data 

In the limit of infinite data the relevant dimensions can be found by maximizing variance, informa- 
tion, or other objective functions 1 1 1. In a real experiment, with a dataset of finite size, the optimal 
vector found by any of the Renyi divergences v will deviate from the true relevant dimension ei. 
In this section we compare the robustness of optimization strategies based on Renyi divergences of 
various orders, including least squares fitting (a = 2) and information maximization (a = 1), as the 
dataset size decreases and/or neural noise increases. 

The deviation from the true relevant dimension Sv ^ v — ei arises because the probability distri- 
butions (|4]i are estimated from experimental histograms and differ from the distributions found in 
the limit of infinite data size. The effects of noise on the reconstruction can be characterized by 
taking the dot product between the relevant dimension and the optimal vector for a particular data 
sample: v ■ ei = 1 — 5<^v^, where both v and ei are normalized, and 6v is by definition orthogo- 
nal to ei. Assuming that the deviation Sv is small, we can use quadratic approximation to expand 
the objective function (obtained with finite data) near its maximum. This leads to an expression 
(5v = — [i/*^"^]^^VF*^"^, which relates deviation Sv to the gradient and Hessian of the objective 
function evaluated at the vector ei. Subscript (a) denotes the order of the Renyi divergence used 
as an objective function. Similarly to the case of optimizing information \2i, the Hessian of Renyi 



divergence of arbitrary order when evaluated along the optimal dimension ei is given by 
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(12) 



where Cy (x) = {{siSj\x) — {si\x){sj\x)) are covariance matrices of inputs sorted by their projec- 
tion X along the optimal dimension. 

When averaged over possible outcomes of N trials, the gradient is zero for the optimal direction. In 
other words, there is no specific direction towards which the deviations (5v are biased. Next, in order 
to measure the expected spread of optimal dimensions around the true one ei, we need to evaluate 

((Jv^) = Tr (VF("' VF'"^^) ^ , and therefore need to know the variance of the gradient 

of F averaged across different equivalent datasets. Assuming that the probability of generating a 



spike is independent for different bins, we find that (VF^'^"^ VP-""*) — i?|"V-^spiko, where 
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(13) 



Therefore an expected error in the reconstruction of the optimal filter by maximizing variance is 
inversely proportional to the number of spikes: 

^ J (14) 



V ■ ei 



1 



1 



1 



2-/Vspikc 

where we omitted superscripts for clarity. Tr' denotes the trace taken in the subspace orthogo- 
nal to the relevant dimension (deviations along the relevant dimension have no meaning [2], which 
mathematically manifests itself in dimension ei being an eigenvector of matrices H and B with the 
zero eigenvalue). Note that when a = 1, which corresponds to Kullback-Leibler divergence and 
information maximization, A = iJ"^^ = B"^^. The asymptotic errors in this case are completely 
determined by the trace of the Hessian of information, ((5v^) cx Tr' [A^^], reproducing the previ- 
ously published result for maximally informative dimensions |2|. Qualitatively, the expected error 
^ Z)/(2A'spikc) increases in proportion to the dimensionality D of inputs and decreases as more 
spikes are collected. This dependence is in common with expected errors of relevant dimensions 
found by maximizing information 121, as well as methods based on computing the spike-triggered 
average both for white noise iri|2T]|22l and correlated Gaussian inputs tZ]. 

Next we examine which of the Renyi divergences provides the smallest asymptotic error (fl4] l for 
estimating relevant dimensions. Representing the covariance matrix as Cy (x) = jik{x)jjk{x) 
(exact expression for matrices 7 will not be needed), we can express the Hessian matrix H and 
covariance matrix for the gradient B as averages with respect to probability distribution P(x|spike): 



B 



dxP{x\sj>ike)b{x)b^ (x), H ~ dx P {x\apike) a{x)b'^ (x) 



(15) 



where the gain function g{x) — P(x|spike)/P(x), and matrices hij{x) = a^ij{x)g' (x) [g{x)]°' 



and 



{b')/{aby 



> 



lij{x)g' (x) / g{x). Cauchy-Schwarz identity for scalar quantities states that, 
l/(a^), where the average is taken with respect to some probability distribution. 



A similar result can also be proven for matrices under a Tr operation as in Eq. (fl4] i. Applying the 
matrix- version of the Cauchy-Schwarz identity to Eq. (fl4l i. we find that the smallest error is obtained 
when 

Tr'[PiJ"2] = Tr'[A"^], with A = j dxP{x\spike)a{x)a^ (x), (16) 

Matrix A corresponds to the Hessian of the merit function for a = 1: A — iJ . Thus, among the 
various optimization strategies based on Renyi divergences, Kullback-Leibler divergence (a = 1) 
has the smallest asymptotic errors. The least square fitting corresponds to optimization based on 
Renyi divergence with a = 2, and is expected to have larger errors than optimization based on 
Kullback-Leibler divergence (a = 1) implementing information maximization. This result agrees 
with recent findings that Kullback-Leibler divergence is the best distortion measure for performing 
lossy compression [i3 |. 



Below we use numerical simulations with model cells to compare the performance of information 
(a — 1) and variance (a = 2) maximization strategies in the regime of relatively small numbers 



of spikes. We are interested in the range 0.1 < -D/iVspiko ^ 1, where the asymptotic resuhs do not 
necessarily apply. The results of simulations are shown in Fig.|2]as a function of I?/iVspikc, as well 
as with varying neural noise levels. To estimate sharper (less noisy) input/output functions with a = 
1.5, 1.0, 0.5, 0.25, we used larger number of bins (16, 21, 32, 64), respectively. Identical numerical 
algorithms, including the number of bins, were used for maximizing variance and information. The 
relevant dimension for each simulated spike train was obtained as an average of 4 jackknife estimates 
computed by setting aside 1/4 of the data as a test set. Results are shown after 1000 line optimizations 
(D = 900), and performance on the test set was checked after every line optimization. As can be 
seen, generally good reconstructions with projection values > 0.7 can be obtained by maximizing 
either information or variance, even in the severely undersampled regime D < iVgpikc- We find that 
reconstruction errors are comparable for both information and variance maximization strategies, and 
are better or equal (at very low spike numbers) than STA-based methods. Information maximization 
achieves significantly smaller errors than the least-square fitting, when we analyze results for all 
simulations for four different models cells and spike numbers {p < 10^^, paired t-test). 
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Figure 2: Projection of vector w,nax obtained by maximizing information (red filled symbols) or 
variance (blue open symbols) on the true relevant dimension ei is plotted as a function of ratio be- 
tween stimulus dimensionality D and the number of spikes iVspikc, with D = 900. Simulations were 
carried out for model visual neurons with one relevant dimension from Fig. [Ha) and the input-output 
function Eq.([T]i described by threshold 6 = 2.0 and noise standard deviation a — 1.5, 1.0, 0.5, 0.25 
for groups labeled A (A), B (y), C (O)' ™d D (□), respectively. The left panel also shows results 
obtained using spike-triggered average (STA, gray) and decorrelated STA (dSTA, black). In the right 
panel, we replot results for information and variance optimization together with those for regularized 
decorrelated STA (RdSTA, green open symbols). All error bars show standard deviations. 

5 Conclusions 

In this paper we compared accuracy of a family of optimization strategies for analyzing neural re- 
sponses to natural stimuli based on Renyi divergences. Finding relevant dimensions by maximizing 
one of the merit functions, Renyi divergence of order 2, corresponds to fitting the linear-nonlinear 
model in the least-square sense to neural spike trains. Advantage of this approach over standard least 
square fitting procedure is that it does not require the nonlinear gain function to be invertible. We 
derived errors expected for relevant dimensions computed by maximizing Renyi divergences of ar- 
bitrary order in the asymptotic regime of large spike numbers. The smallest errors were achieved not 
in the case of (nonlinear) least square fitting of the linear-nonlinear model to the neural spike trains 
(Renyi divergence of order 2), but with information maximization (based on Kullback-Leibler di- 
vergence). Numeric simulations on the performance of both information and variance maximization 
strategies showed that both algorithms performed well even when the number of spikes is very small. 
With small numbers of spikes, reconstructions based on information maximization had also slightly, 
but significantly, smaller errors those of least-square fitting. This makes the problem of finding rel- 
evant dimensions, together with the problem of lossy compression 1.23. .3J, one of examples where 



information-theoretic measures are no more data limited than those derived from least squares. It 
remains possible, however, that other merit functions based on non-polynomial divergence measures 
could provide even smaller reconstruction errors than information maximization. 
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